Line Fitting, Residuals and Correlation

Modeling Numerical Variables

In this unit we will learn to quantify the relationship between two numerical variables, as well as modeling numerical response variables using a numerical or categorical explanatory variable.

Poverty versus High School graduation rate

The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the percent of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

Poverty versus High School graduation rate

Response Variable?

Poverty versus High School graduation rate

Response Variable? Percentage in poverty

Poverty versus High School graduation rate

Response Variable? Percentage in poverty

Explanatory Variable?

Poverty versus High School graduation rate

Response Variable? Percentage in poverty

Explanatory Variable? Percentage of HS graduates

Poverty versus High School graduation rate

Response Variable? Percentage in poverty

Explanatory Variable? Percentage of HS graduates

Relationship?

Poverty versus High School graduation rate

Response Variable? Percentage in poverty

Explanatory Variable? Percentage of HS graduates

Relationship? Linear, negative, moderately strong

Quantifying the Relationship

  • Correlation describes the strength of the linear association between the two variables

Quantifying the Relationship

  • Correlation describes the strength of the linear association between the two variables
  • Takes on values between \(-1\) and \(+1\) (inclusive)

Quantifying the Relationship

  • Correlation describes the strength of the linear association between the two variables
  • Takes on values between \(-1\) and \(+1\) (inclusive)
  • A value of \(0\) indicates no linear association

Guessing the Correlation

Which of the following is the best guess for the correlation between percentage in poverty and percentage of HS graduates?

  • 0.6
  • -0.75
  • -0.1
  • 0.02
  • -1.5

Guessing the Correlation

Which of the following is the best guess for the correlation between percentage in poverty and percentage of HS graduates?

  • 0.6
  • -0.75
  • -0.1
  • 0.02
  • -1.5

Guessing the Correlation (2)

Which of the following is the best guess for the correlation between percentage in povery and percentage of female householder.

  • 0.1
  • -0.6
  • -0.4
  • 0.9
  • 0.5

Guessing the Correlation (2)

Which of the following is the best guess for the correlation between percentage in povery and percentage of female householder.

  • 0.1
  • -0.6
  • -0.4
  • 0.9
  • 0.5

Assessing the Correlation

Which of the following has the strongest correlation, i.e., the correlation coefficient closest to \(-1\) or \(+1\).

Assessing the Correlation

Option (b). While (a) clearly 'tracks', it's not actually linear.

Fitting a line by least squares regression

Eyeballing the line

Which of the following appears to be the line that best fits the linear relationship between percentage in poverty and percentage of HS grad? Choose one.

Eyeballing the linear

Which of the following appears to be the line that best fits the linear relationship between percentage in poverty and percentage of HS grad? Choose one.

The best fit appears to be (a) (over (d)). (b) and (c) aren't good fits at all.

Residuals

Residuals are the leftovers from the model fit.

\[ \text{Data} = \text{Fit} + \text{Residuals} \]

Residuals (ctd.)

Residuals are the difference between the observed (\(y_i\)) and the predicted (\(\hat{y}_i\)). We label these as \(e_i\).

\[ e_i = y_i - \hat{y}_i \]

Note: the other way is not correct! We always take observed minus predicted, not the other way around.

Residuals (ctd.)

The labeled points indicate that:

  • % living in poverty in DC is 5.44% more than predicted.
  • % living in poverty in RI is 4.16% less than predicted.

A measure for the best line

  • We want a line that has small residuals

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: minimize the sum of the squared residuals (least squares) \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: minimize the sum of the squared residuals (least squares) \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]
  • Why least squares?

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: minimize the sum of the squared residuals (least squares) \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]
  • Why least squares?
    • Most commonly used

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: minimize the sum of the squared residuals (least squares) \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]
  • Why least squares?
    • Most commonly used
    • Easier to compute by hand, and with software

A measure for the best line

  • We want a line that has small residuals
    • Option 1: minimize the sum of the absolute values of the residuals, \[ |e_1| + |e_2| + \cdots + |e_n| \]
    • Option 2: minimize the sum of the squared residuals (least squares) \[ e_1^2 + e_2^2 + \cdots + e_n^2 \]
  • Why least squares?
    • Most commonly used
    • Easier to compute by hand, and with software
    • In many applications, a residual twice as large is more than twice as bad

The least squares line

Notation:

  • Intercept
    • Parameter: \(\beta_0\)
    • Point estimate: \(b_0\)
  • Slope
    • Parameter: \(\beta_1\)
    • Point estimate: \(b_1\)

Given \(\ldots\)

Slope

The slope of the regression line (remember: \(y = mx + b\)) can be calculated as

\[ b_1 = \frac{s_y}{s_x} R \]

In context:

\[ b_1 = \frac{3.1}{3.73} \cdot -0.75 = -0.62 \]

Interpretation: for each additional percentage point in HS graduation rate, we would expect the percentage living in poverty to be lower on average by 0.62% points.

Intercept

The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (\(\bar{x}, \bar{y}\)).

\[ b_0 = \bar{y} - b_1 \bar{x} \]

Intercept

\[ b_0 = \bar{y} - b_1 \bar{x} \]

We calculate: \[ b_0 = 11.35 - (-0.62) \cdot 86.01 = 64.68 \]

Practice

Which of the following is the correct interpretation of the intercept?

  • For each % point increase in HS graduation rates, the % living in poverty is expected to increase on average by 64.68%.
  • For each % point decrease in HS graduation rate, the % living in poverty is expected to increase on average by 64.68%.
  • Having no HS graduates leads to 64.68% of residents living below the poverty line.
  • States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line.
  • In states with no HS graduates the % living in poverty is expected to increase on average by 64.68%.

Practice

Which of the following is the correct interpretation of the intercept?

  • For each % point increase in HS graduation rates, the % living in poverty is expected to increase on average by 64.68%.
  • For each % point decrease in HS graduation rate, the % living in poverty is expected to increase on average by 64.68%.
  • Having no HS graduates leads to 64.68% of residents living below the poverty line.
  • States with no HS graduates are expected on average to have 64.68% of residents living below the poverty line.
  • In states with no HS graduates the % living in poverty is expected to increase on average by 64.68%.

More on the intercept

Since there are no states in the data set with zero HS graduates, the intercept is of no interest, not very useful, and also not reliable since the predicted value of the intecept is so far from all of the data.

Regression Line

Interpretation of slope and intercept

  • Intercept: when \(x=0\), \(y\) is expected to equal the intercept
  • Slope: for each unit change in \(x\), \(y\) is expected to increase/decrease on average by the value of the slope.

Note: these statements are not causal, unless the study is a randomized controlled experiment.

Prediction

  • Using the linear model to predict the value of the response variable for a given value of the explanatory variable is called prediction, and consists of plugging the value of \(x\) into the linear model equation
  • There will be some uncertainty associated with the predicted value

Extrapolation

  • Applying a model estimate to values outside of the range of the original data is called extrapolation
  • Sometimes the intercept is an extrapolation

Examples of Extrapolation

Examples of Extrapolation

Examples of Extrapolation

Conditions for Least Squares Lines

  • Linearity

Conditions for Least Squares Lines

  • Linearity
  • Nearly normal residuals

Conditions for Least Squares Lines

  • Linearity
  • Nearly normal residuals
  • Constant variability

Conditions: (1) Linearity

  • the relationships between the explanatory and response variables should be linear

Conditions: (1) Linearity

  • the relationships between the explanatory and response variables should be linear
  • methods for fitting a model to non-linear relationships exist, but are mostly beyond the scope of this course

Conditions: (1) Linearity

  • the relationships between the explanatory and response variables should be linear
  • methods for fitting a model to non-linear relationships exist, but are mostly beyond the scope of this course
  • check using a scatterplot or residual plot of the data (or both)

Anatomy of a residuals plot

Anatomy of a residuals plot


Conditions: (2) Nearly normal residuals

  • the residuals should be nearly normal

Conditions: (2) Nearly normal residuals

  • the residuals should be nearly normal
  • this condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data

Conditions: (2) Nearly normal residuals

  • the residuals should be nearly normal
  • this condition may not be satisfied when there are unusual observations that don't follow the trend of the rest of the data
  • check using a histogram and/or a normal QQ (probability) plot of the residuals

Conditions: (3) Constant variability

  • The variability of points around the least squares line should be roughly constant

Conditions: (3) Constant variability

  • The variability of points around the least squares line should be roughly constant
  • This implies that the variability of the residuals around the \(0\) line should be roughly constant as well

Conditions: (3) Constant variability

  • The variability of points around the least squares line should be roughly constant
  • This implies that the variability of the residuals around the \(0\) line should be roughly constant as well
  • This is given the technical name homoscedasticity

Conditions: (3) Constant variability

  • The variability of points around the least squares line should be roughly constant
  • This implies that the variability of the residuals around the \(0\) line should be roughly constant as well
  • This is given the technical name homoscedasticity
  • Check using a histogram or normal (QQ) probability plot of the residuals

Example: Checking Conditions

What condition is this linear model obviously violating?

  • Constant variability
  • Linear relationship
  • Normal residuals
  • No extreme outliers

Example: Checking Conditions

What condition is this linear model obviously violating?

  • Constant variability
  • Linear relationship
  • Normal residuals
  • No extreme outliers

Example: Checking Conditions

What condition is this linear model obviously violating?

  • Constant variability
  • Linear relationship
  • Normal residuals
  • No extreme outliers

Example: Checking Conditions

What condition is this linear model obviously violating?

  • Constant variability
  • Linear relationship
  • Normal residuals
  • No extreme outliers

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
  • \(R^2\) is calculated as the square of the correlation coefficient

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
  • \(R^2\) is calculated as the square of the correlation coefficient
  • It tells us what percentage of the variability in the response variable is explained by the model

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
  • \(R^2\) is calculated as the square of the correlation coefficient
  • It tells us what percentage of the variability in the response variable is explained by the model
  • The remainder of the variability is explained either by variables not included in the model (not known!), or by inherent randomness of the data

\(R^2\)

  • The strength of the fit of a linear model is most commonly evaluated using \(R^2\)
  • \(R^2\) is calculated as the square of the correlation coefficient
  • It tells us what percentage of the variability in the response variable is explained by the model
  • The remainder of the variability is explained either by variables not included in the model (not known!), or by inherent randomness of the data
  • For the model we've been working with, \(R^2 = (-0.62)^2 = 0.38\)

Interpretation of \(R^2\)

Which of the below is the correct interpretation of \(R = -0.62, R^2 = 0.38\)

  • 38% of the variability in the % of HG graduates among the 51 states is explained by the model.
  • 38% of the variability in the % of residents living in poverty among the 51 states is explained by the model.
  • 38% of the time % HS graduates predict % living in poverty correctly.
  • 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model.

Interpretation of \(R^2\)

Which of the below is the correct interpretation of \(R = -0.62, R^2 = 0.38\)

  • 38% of the variability in the % of HG graduates among the 51 states is explained by the model.
  • 38% of the variability in the % of residents living in poverty among the 51 states is explained by the model.
  • 38% of the time % HS graduates predict % living in poverty correctly.
  • 62% of the variability in the % of residents living in poverty among the 51 states is explained by the model.

Poverty versus Region (East, West)

  • Explanatory variable: region, reference level: east
  • Intercept: the estimated average poverty percentage in the eastern states is 11.17%
    • This is the value we get if we plug in \(0\) for the explanatory variable

Poverty versus Region (East, West)

  • Explanatory variable: region, reference level: east
  • Intercept: the estimated average poverty percentage in the eastern states is 11.17%
    • This is the value we get if we plug in \(0\) for the explanatory variable
  • Slope: the estimated average poverty percentage in western states is 0.38% higher than in eastern states
    • The estimated average poverty percentage in western states is \(11.17 + 0.38 = 11.55%\).
    • This is the value we get if we plug in \(1\) for the explanatory variable

Poverty versus Region (northeast, midwest, west, south)

Which region (northeast, midwest, west or south) is the reference level?

  • northeast
  • midwest
  • west
  • south
  • cannot tell

Poverty versus Region (northeast, midwest, west, south)

Which region (northeast, midwest, west or south) is the reference level?

  • northeast
  • midwest
  • west
  • south
  • cannot tell

Doing Linear Regression in R

(more to come here)

Types of Outliers in Linear Regression

Types of Outliers

How do outliers influence the least squares line in this plot?

To answer this question think of where the regression line would be with and without the outlier(s). Without the outliers the regression line would be steeper, and lie closer to the larger group of observations. With the outliers the line is pulled up and away from some of the observations in the larger group.

Types of Outliers

How do outliers influence the least squares line in this plot?

Types of Outliers

How do outliers influence the least squares line in this plot?

Without the outlier, there is no evident relationship between \(x\) and \(y\).

Some Terminology

  • Outliers are points that lie away from the cloud of points.

Some Terminology

  • Outliers are points that lie away from the cloud of points.
  • Outliers that lie horizontally away from the center of the cloud are called high leverage points.

Some Terminology

  • Outliers are points that lie away from the cloud of points.
  • Outliers that lie horizontally away from the center of the cloud are called high leverage points.
  • High leverage points that actually influence the slope of the regression line are called influential points.

Some Terminology

  • Outliers are points that lie away from the cloud of points.
  • Outliers that lie horizontally away from the center of the cloud are called high leverage points.
  • High leverage points that actually influence the slope of the regression line are called influential points.
  • In order to determine if a point is influential, visualize the regression line with and without the point. Does the slope of the line change considerably? If so, then the point is influential. If not, then it’s not an influential point.A

Influential Points

Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1.

Types of Outliers

Which of the below best describes the outlier?

  1. influential
  2. high leverage
  3. none of the above
  4. there are no outliers

Types of Outliers

Which of the below best describes the outlier?

  1. influential
  2. high leverage
  3. none of the above
  4. there are no outliers

Types of Outliers

Does this outlier influence the slope of the regression line?

Types of Outliers

Not really …

Recap

Which of the following is true?

  1. Influential points always change the intercept of the regression line.
  2. Influential points always reduce R2.
  3. It is much more likely for a low leverage point to be influential, than a high leverage point.
  4. When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear.
  5. None of the above.

Recap

Which of the following is true?

  1. Influential points always change the intercept of the regression line.
  2. Influential points always reduce R2.
  3. It is much more likely for a low leverage point to be influential, than a high leverage point.
  4. When the data set includes an influential point, the relationship between the explanatory variable and the response variable is always nonlinear.
  5. None of the above.

Recap (continued)

Inference for Linear Regression

Nature or Nurture?

In 1966 Cyril Burt published a paper called The genetic determination of differences in intelligence: A study of monozygotic twins reared apart? The data consist of IQ scores for (an assumed random sample of) 27 identical twins, one raised by foster parents, the other by the biological parents.

Practice

Which of the following is false?
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       9.20760    9.29990   0.990    0.332    
bioIQ             0.90144    0.09633   9.358  1.2e-09

Residual standard error: 7.729 on 25 degrees of freedom
Multiple R-squared: 0.7779, Adjusted R-squared: 0.769 
F-statistic: 87.56 on 1 and 25 DF,  p-value: 1.204e-09 
  1. An additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average.
  2. Roughly 78% of the foster twins' IQs can be accurately predicted by the model.
  3. The linear model is \(\hat{\text{fosterIQ}} = 9.2 + 0.9 \cdot \text{bioIQ}\).
  4. Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well.

Practice

Which of the following is false?
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       9.20760    9.29990   0.990    0.332    
bioIQ             0.90144    0.09633   9.358  1.2e-09

Residual standard error: 7.729 on 25 degrees of freedom
Multiple R-squared: 0.7779, Adjusted R-squared: 0.769 
F-statistic: 87.56 on 1 and 25 DF,  p-value: 1.204e-09 
  1. An additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average.
  2. Roughly 78% of the foster twins' IQs can be accurately predicted by the model.
  3. The linear model is \(\hat{\text{fosterIQ}} = 9.2 + 0.9 \cdot \text{bioIQ}\).
  4. Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well.

Testing for the Slope

Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?

  1. \(H_0: b_0 = 0; H_A: b_0 \neq 0\)
  2. \(H_0: \beta_0 = 0; H_A: \beta_0 \neq 0\)
  3. \(H_0: b_1 = 0; H_A: b_1 \neq 0\)
  4. \(H_0: \beta_1 = 0; H_A: \beta_1 \neq 0\)

Testing for the Slope

Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?

  1. \(H_0: b_0 = 0; H_A: b_0 \neq 0\)
  2. \(H_0: \beta_0 = 0; H_A: \beta_0 \neq 0\)
  3. \(H_0: b_1 = 0; H_A: b_1 \neq 0\)
  4. \(H_0: \beta_1 = 0; H_A: \beta_1 \neq 0\)

Testing for the Slope (ctd.)

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression
    (remember: \(t_\text{test} =\)(point estimate - null value)/SE)

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression
    (remember: \(t_\text{test} =\)(point estimate - null value)/SE)
  • The point estimate is \(b_1\), the observed slope

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression
    (remember: \(t_\text{test} =\)(point estimate - null value)/SE)
  • The point estimate is \(b_1\), the observed slope
  • The standard error, \(SE_{b_1}\) is the standard error associated with the slope

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression
    (remember: \(t_\text{test} =\)(point estimate - null value)/SE)
  • The point estimate is \(b_1\), the observed slope
  • The standard error, \(SE_{b_1}\) is the standard error associated with the slope
  • The degrees of freedom associated with the slope are df = \(n-2\), with \(n\) the sample size

Testing for the Slope (ctd.)

  • We always use a \(t\)-test in inference for regression
    (remember: \(t_\text{test} =\)(point estimate - null value)/SE)
  • The point estimate is \(b_1\), the observed slope
  • The standard error, \(SE_{b_1}\) is the standard error associated with the slope
  • The degrees of freedom associated with the slope are df = \(n-2\), with \(n\) the sample size
    (remember: we lose 1 degree of freedom for each parameter we estimate, and in a simple linear regression, we are estimating both \(\beta_0\) and \(\beta_1\))

Testing for the Slope (ctd.)

Testing for the Slope (ctd.)

\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \end{split} \]

Testing for the Slope (ctd.)

\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \text{df} &= 27 - 2 = 25 \\ \end{split} \]

Testing for the Slope (ctd.)

\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \text{df} &= 27 - 2 = 25 \\ p-\text{value} &= P\left( |t_\text{test}| > 9.36 \right) < 0.01 \end{split} \]

Percent college graduate versus percent Hispanic in Los Angeles

What can you say about the relationship between % college graduate and % Hispanic in a sample of 100 zip code areas in LA?

% college graduate versus % Hispanic in Los Angeles (another look)

What can you say about the relationship between % college graduate and % Hispanic in a sample of 100 zip code areas in LA?

% college graduate versus % Hispanic in Los Angeles (another look)

Which of the below is the best interpretation of the slope?
  1. A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads.
  2. A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads.
  3. An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%.
  4. In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

% college graduate versus % Hispanic in Los Angeles (another look)

Which of the below is the best interpretation of the slope?
  1. A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads.
  2. A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads.
  3. An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%.
  4. In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

% college graduate versus % Hispanic in Los Angeles (another look)

Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?

How reliable is this p-value if these zip code areas are not randomly selected?

% college graduate versus % Hispanic in Los Angeles (another look)

Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?

Yes, since the p-value for % Hispanic is low, this indicates that the data provide convincing evidence that the slope parameter is different to 0.

How reliable is this p-value if these zip code areas are not randomly selected?

Not very …

Confidence Interval for the Slope (back to the Twins example)

Remember that a confidence interval is calculated as point estimate \(\pm\) ME and the degrees of freedom associated with the slope in a simple linear regression is \(n - 2\). Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.

  1. 9.2076 \(\pm\) 1.65 \(\cdot\) 9.2999
  2. 0.9014 \(\pm\) 2.06 \(\cdot\) 0.0963
  3. 0.9014 \(\pm\) 1.96 \(\cdot\) 0.0963
  4. 9.2076 \(\pm\) 1.96 \(\cdot\) 0.0963

Confidence Interval for the Slope (back to the Twins example)

\[ \begin{split} n &= 27 \\ \text{df} &= 27 - 2 = 25 \end{split} \]

Confidence Interval for the Slope (back to the Twins example)

qt(p = 0.975, df = 25, lower.tail = TRUE)
## [1] 2.059539

\[ \begin{split} n &= 27 \\ \text{df} &= 27 - 2 = 25\\ t_{25}^* &= 2.06 \end{split} \]

Confidence Interval for the Slope (back to the Twins example)

Remember that a confidence interval is calculated as point estimate \(\pm\) ME and the degrees of freedom associated with the slope in a simple linear regression is \(n - 2\). Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.

  1. 9.2076 \(\pm\) 1.65 \(\cdot\) 9.2999
  2. 0.9014 \(\pm\) 2.06 \(\cdot\) 0.0963
  3. 0.9014 \(\pm\) 1.96 \(\cdot\) 0.0963
  4. 9.2076 \(\pm\) 1.96 \(\cdot\) 0.0963

Recap

  • Inference for the slope for a single-predictor linear regression mode:

Recap

  • Inference for the slope for a single-predictor linear regression mode:
    • Hypothesis test:
      \[ t_\text{test} = \frac{b_1 - \text{null value}}{SE_{b_1}} \qquad \text{df} = n-2 \]

Recap

  • Inference for the slope for a single-predictor linear regression mode:
    • Hypothesis test:
      \[ t_\text{test} = \frac{b_1 - \text{null value}}{SE_{b_1}} \qquad \text{df} = n-2 \]
    • Confidence interval:
      \[ b_1 \pm t_{\text{df} = n-2}^* \cdot SE_{b_1} \]

Recap

  • Inference for the slope for a single-predictor linear regression mode:
    • Hypothesis test:
      \[ t_\text{test} = \frac{b_1 - \text{null value}}{SE_{b_1}} \qquad \text{df} = n-2 \]
    • Confidence interval:
      \[ b_1 \pm t_{\text{df} = n-2}^* \cdot SE_{b_1} \]
  • The null value is often \(0\) since we are usually checking for any relationship between the explanatory and response variables

Recap

  • Inference for the slope for a single-predictor linear regression mode:
    • Hypothesis test:
      \[ t_\text{test} = \frac{b_1 - \text{null value}}{SE_{b_1}} \qquad \text{df} = n-2 \]
    • Confidence interval:
      \[ b_1 \pm t_{\text{df} = n-2}^* \cdot SE_{b_1} \]
  • The null value is often \(0\) since we are usually checking for any relationship between the explanatory and response variables
  • The regression output gives \(b_1\), \(SE_{b_1}\) and the two-tailed p-value for the \(t_\text{test}\) for the slope with a null of \(0\)

Recap

  • Inference for the slope for a single-predictor linear regression mode:
    • Hypothesis test:
      \[ t_\text{test} = \frac{b_1 - \text{null value}}{SE_{b_1}} \qquad \text{df} = n-2 \]
    • Confidence interval:
      \[ b_1 \pm t_{\text{df} = n-2}^* \cdot SE_{b_1} \]
  • The null value is often \(0\) since we are usually checking for any relationship between the explanatory and response variables
  • The regression output gives \(b_1\), \(SE_{b_1}\) and the two-tailed p-value for the \(t_\text{test}\) for the slope with a null of \(0\)
  • We rarely do inference on the intercept, so we'll be focusing on the estimates and inference for the slope

Caution

  • Always be aware of the type of data you're working with: random sample, non-random sample, or population.

Caution

  • Always be aware of the type of data you're working with: random sample, non-random sample, or population.
  • Statistical inference, and the resulting p-values, are meaningless when you already have population data.

Caution

  • Always be aware of the type of data you're working with: random sample, non-random sample, or population.
  • Statistical inference, and the resulting p-values, are meaningless when you already have population data.
  • If you have a sample that is non-random (biased), inference on the results will be unreliable.

Caution

  • Always be aware of the type of data you're working with: random sample, non-random sample, or population.
  • Statistical inference, and the resulting p-values, are meaningless when you already have population data.
  • If you have a sample that is non-random (biased), inference on the results will be unreliable.
  • The ultimate goal is to have independent observations.